34 ◾ Bioinformatics
1.6 PREPROCESSING OF THE FASTQ READS
In the above, we discussed the assessment of the quality of the reads produced by the HTS
instruments to understand the potential errors and biases that may arise from warnings or
failures of the quality metrics. Before moving on to the next step for data analysis, errors
and biases should be adjusted to avoid incorrect results and misleading interpretation. In
general, there are three common approaches to fix the biases resulted from the quality
metrics. Those three approaches include (i) trimming the ends of the reads, (ii) removing
low-quality reads, and (iii) masking low-quality bases. The use of any of those approaches
depends on the quality problem. In the following, we will discuss the most commonly used
programs to deal with read quality issues.
The most commonly used software for the processing of raw sequence reads in FASTQ
files is FASTX-toolkit [14], which is a collection of command-line programs. The installa-
tion instructions of FASTX-toolkit are available at “http://hannonlab.cshl.edu/fastx_tool-
kit/download.html”. We can download and install it on Linux using the following steps:
Create a directory in which you can download the FASTX-toolkit compressed file:
mkdir fastxtoolkit
cd fastxtoolkit
Download the compressed program file and decompress it:
wget http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit_0.0.13_
binaries_Linux_2.6_amd64.tar.bz2
tar xvf fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2
Copy the program files from the “bin” directory to “/usr/local/bin” so that it can be exe-
cuted from any directory on the computer:
sudo cp ./bin/* /usr/local/bin
FIGURE 1.27 Failed k-mer content.